perm filename CHARAC.PRO[ESS,JMC]1 blob
sn#003914 filedate 1972-07-09 generic text, type T, neo UTF8
00100 ARBITRARY CHARACTER SETS
00200
00300 by John McCarthy
00400
00500
00600 It would be nice to be able to have documents stored in
00700 computers that could include arbitrary characters and to be able to
00800 display them on any CRT screen, edit them using any keyboard, and
00900 print them on any printer. The object of this memorandum is to
01000 suggest how to get there from here with special reference to the ARPA
01100 network.
01200
01300 Where are we now?
01400
01500 1. At present, there is 96 character ASCII, and everyone
01600 agrees that it should be included in any larger set.
01700
01800 2. Many installations are dependent on 64 character sets
01900 which do not even include the lower case latin alphabet.
02000
02100 3. At the Stanford Artificial Intelligence Laboratory, we
02200 have a 114 character set that includes 96 character ASCII and which
02300 is implemented in our keyboards, displays, and line printer.
02400
02500 4. Printers are becoming available that get their character
02600 designs out of memory, for example, the Xerox XGP printer, one of
02700 which we are getting.
02800
02900 5. The IMLAC type display has the character designs in main
03000 memory so that changing the displayed set is just a matter of
03100 reloading the memory.
03200
03300 6. Many display systems share the character generator among
03400 many display units. In some of these, e.g. the Datadisc, arbitrary
03500 sets are probably feasible (using kludgery to be described later),
03600 but in other systems, e.g. our III's, arbitrary sets are not
03700 feasible.
03800
03900 One possible approach to communication in expanded character
04000 sets is to produce an expanded standard set of characters, perhaps
04100 using 8 or 9 bits and expect new equipment to implement this set.
04200 This approach has the disadvantage that it will be very hard to get
04300 agreement on what the next step should be, and even if formal
04400 agreement is realized, many groups will find it in their interests to
04500 ignore the standard.
04600
04700 Therefore, I would like to suggest that the next step be to
04800 arbitrary character sets. I suggest implementing this in the
04900 following way:
05000
05100 1. There be established a registry of characters. Anyone can
05200 register a new character. Each character has a unique number, 17
05300 bits should be enough even to include Chinese. Besides this, each
05400 character has a name in ASCII usually mnemonic. Finally, the
05500 character has a design which is a picture on a 50 by 50 dot matrix.
05600
05700 2. Besides the registry of characters, there is a registry of
05800 character sets, which different groups are using for different
05900 classes of documents. A registered character set has a registry
06000 number and a table giving the correspondence between the character
06100 codes as bit sequences and the registered character numbers.
06200
06300 3. Associated with a document is a statement of the character
06400 code used therein. This may be one of the registered codes or it may
06500 contain in addition modifications described by an auxiliary table
06600 giving the code correspondence with registered character numbers. A
06700 character code may have an escape character that says that the next
06800 character is described by its registry number. The statement of the
06900 character code may be a header on the document or the receiver may
07000 have to learn it by some other means, e.g. because its library
07100 catalog entry contains this information.
07200
07300 4. Devices such as printers and displays draw characters in
07400 different ways and standardization doesn't seem feasible at present.
07500 Therefore, it is necessary to provide a way of going from the
07600 standard description of a character using a 50 by 50 dot matrix to
07700 whatever method the device uses. This is up to the programmers who
07800 are supporting the device. Some may choose to manually create files
07900 describing how registered characters are implemented. They may find
08000 it too much work to provide for all the characters and to update
08100 their files when new characters are registered. Others will provide
08200 programs for going from the registered descriptions to descriptions
08300 compatible with their implementations. Perhaps most will hand tailor
08400 the characters most used and provide a program for the others.
08500
08600 5. The easiest device to handle is the line printer because
08700 it is slow. At the beginning of the print job, the SPOOL program
08800 will look up the character set and load the printers memory with the
08900 character designs used in the particular document. Sometimes, it may
09000 have to go through the network to one of the computers that stores
09100 the registry in order to find out what to do.
09200
09300 6. Display systems that have a character memory for each
09400 display unit can be handled in about the same way. Users will
09500 occasionally experience delays when the display programs are
09600 surprised by unfamiliar characters.
09700
09800 7. Display systems that share character memories require more
09900 complicated treatment. The object is to keep the memory large enough
10000 to keep all the characters that the current set of users is using and
10100 to handle the required table lookups from the different character
10200 codes in a nice way. There will be limitations on the diversity of
10300 character sets that can be in use simultaneously. Systems like the
10400 Datadisc that only look up the character when it is first written can
10500 be extended to work with large sets. Systems that have to look up
10600 each character code 30 times per second in order to maintain the
10700 display won't work so well.
10800
10900 I have no special ideas about how to make keyboards adaptable
11000 to arbitrary sets. Each user may have to fend for himself.
11100
11200 In this memorandum so far, I have ignored typography, i.e.
11300 the fact that in printed documents the same letter may be printed in
11400 many fonts. Perhaps, each character in each font will require a
11500 separate registered description, but with a constant difference
11600 between the numbers of the same character in different fonts.
11700 Installations will again have to decide what font distinctions they
11800 will implement.
11900
12000 Some other issues that might be considered are whether means
12100 can be provided to adapt texts automatically to the line and page
12200 lengths of the different devices.
12300
12400 It seems to me most likely that the typographical problems
12500 cannot be solved at this time, and it would be best to adopt
12600 conventions for registering character designs at this time, and leave
12700 typography for later.
12800
12900 In my opinion, there is no real obstacle to establishing the
13000 registry in the ARPA network now, getting the standards organizations
13100 to work, and being able to exchange documents in extended character
13200 sets as soon as the various installations can acquire the printers
13300 and display devices.
13400
13500 It is the present policy of the Stanford Artificial
13600 Intelligence Laboratory to acquire no more devices that are wedded to
13700 fixed character sets.